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Abstract 

The performance of large neural networks can be judged not only by their 
storage capacity but also by the time required for learning. A polynomial 
learning algorithm with learning time ~ N 2 in a network with N units might 
be practical whereas a learning time ~ e^ would allow rather small networks 
only. The question of absolute storage capacity a c and capacity for polyno- 
mial learning rules a p is discussed for several feed-forward architectures, the 
perceptron, the binary perceptron, the committee machine and a perceptron 
with fixed weights in the first layer and adaptive weights in the second layer. 
The analysis is based partially on dynamic mean field theory which is valid 
for N — > oo. Especially for the committee machine a value a p considerably 
lower than the capacity predicted by replica theory or simulations is found. 
This discrepancy is resolved by new simulations investigating the learning 
time dependence and revealing subtleties in the definition of the capacity. 

1 Introduction 

Given some neural network architecture and some set of training examples, the 
question is not only, whether the network is able to learn the set, but also how many 
training cycles are required to do so. Especially for large networks the requirement 
of reasonably fast learning algorithms can pose decisive restrictions. To be more 
precise, for a network with N nodes a polynomial learning algorithm, with learning 
time growing as N 2 , might be acceptable for rather large N whereas an exponential 
behavior ~ e N would tolerate small networks only. This question is going to 
be analyzed in the following for the task of a random classification (dichotomy) 
of a random set of P input patterns in various feed forward networks with N 
input nodes and a single threshold output unit. The quantities of interest are 
the maximal storage capacity a c = P m ax/N and the maximal polynomial capacity 
a p = Pp/N with P p being the maximal size of the training set which can be learned 
in polynomial time. 
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Methods of statistical physics have turned out to be quite useful for the inves- 
tigation of various properties of large neural networks. One of the most prominent 
examples is the computation of the storage capacity of a simple perceptron with- 
out hidden units by E. Gardner (1988). For unbiased examples a c = 2 is found 
for N — > oo. Learning can be done by the perceptron learning rule proposed by 
Rosenblatt (1962) which determines suitable couplings for a < a c after a finite 
number of learning cycles. Each learning cycle consists of the presentation of each 
pattern once and the adjustment of the couplings for each pattern requires ~ iV 
computations. The total learning algorithm is therefore polynomial ~ N 2 and 
a p = a c . An efficient way to implement such a rule is the adatron described by 
Anlauf and Biehl (1990). 

The question of maximal storage capacity and learning has also been addressed 
for more complicated networks, for instance a perceptron with binary weights 
Wi = ±1 (Kraut and Mezard (1989), Horner (1992), Horner (1993)) and for per- 
ceptrons with hidden layers (Barkai et al. (1992), Engel et al. (1992), Priel et al. 
(1994)). For those architectures a p < a c is expected. For the maximal storage ca- 
pacity, following Gardner's approach (Gardner (1988)), the volume in the space of 
synaptic weights Wi compatible with the learning task is computed within replica 
theory. The maximal storage capacity is reached if this volume shrinks to zero. 
Depending on a, a solution with broken replica symmetry can be found, indicating 
the decomposition of the available phase space into disjoint ergodic components. 
Alternatively within the framework of dynamic mean field theory a stochastic mo- 
tion of the weights Wi(t) in the available part of the phase space is investigated 
(Horner (1992)). Depending on a, diverging time scales can appear. This indi- 
cates again ergodicity breaking and decomposition of the phase space into disjoint 
ergodic components. The critical values of a need, however, not be the same for 
the two approaches. Both schemes can be generalized to finite temperature (noise) 
and for slowly decreasing temperature the above process corresponds to learning 
by simulated annealing. For a finite number of temperature steps this procedure 
is polynomial ~ N 2 . 

In the following we discuss the question of learning in polynomial time for 
several examples. We start with the simple perceptron, briefly describe the ideas 
behind the dynamic mean field analysis and turn to the perceptron with binary 
weights. This is followed by a discussion of perceptrons with one layer of hidden 
units. In the committee machine learning is done for the weights connecting the 
input nodes with the hidden units and fixed connections from the hidden units to 
the output node. The number of hidden units in this case is assumed to be finite. 
Another architecture (coding machine) proposed by Bethge et al. (1994) has fixed 
connections in the first layer which map the input onto a large layer of hidden 
units. Learning is done with the weights connecting the hidden units to the output 
unit. 

We focus exclusively on learning a set of unbiased random patterns leaving aside 
the most interesting question of generalization ability for patterns constructed by 
some other rule. 
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2 The Perceptron 



The perceptron can be viewed as the elementary building block of any neural 
network. It has N input nodes and a single output node connected by synaptic 
weights Wi. The set of patterns // = 1 • • • P is characterized by its inputs £f = ±1 
and the desired outputs = ±1, both chosen randomly with equal probability. 
On presentation of pattern /z, the output unit receives a stimulus 



The learning task is to find weights such that sign(/i M ) = for all patterns. 
Without loss of generality we may choose = 1 for all patterns and consider the 
more stringent task > k > together with the constraint J2i W? = N. The 
resulting maximal storage capacity a c (n) is (Gardner (1988)) for N — > oo 

a c (n) J-oc V2^ 

and cc c (0) = 2. At maximal loading, the probability distribution of the stimuli 
is 

P(h) = Ws(h-h,) (3) 



{l + erf -^=\S(h - «) + --L e- h2/2 6(/i - re). 



We are going to use this result later on. 

As learning rule (for k — 0) we have investigated a slightly modified adatron 
(Anlauf and Biehl(1990)), where upon presentation of pattern \i the weights are 
modified according to 

AfjWi = * 7 (f )(«(*) - e(«(t) - - ijWi. (4) 



AT 

The learning time t counts the number of learning cycles. The parameters are 
allowed to change during the learning process such that j(t) — > 1 and — > 
for i — > oo. The last term ensures the normalization (W?) = 1 of the weights. 
Simulations yield for the median t me d of the learning time (time required for error 
free learning in 50% of the training sets investigated) 

5.5 

tmed ~ ~ • (5) 

2 — a 

Since each cycle requires ~ N 2 computations this algorithm is polynomial. 
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3 Dynamic mean field theory 



Dynamic mean field theory was originally introduced by Sompolinsky and Zippelius 
(1982) as alternative to the replica theory of spin glasses. It was applied to learning 
in perceptrons with binary weights by Horner (1992) and to perceptrons with 
hidden units by Bethge (1997). 

Applying dynamic mean field theory to learning in neural networks one defines 
some cost function depending on the set of patterns and the weights. It corresponds 
to the energy in physical systems, and it is chosen to be zero for weights such that 
all patterns are classified without errors, and positive otherwise. The weights Wi(t) 
are considered as dynamic variables following some stochastic equation of motion, a 
Langevin equation for continuous weights or a master equation for discrete weights. 
Both equations allow for finite temperatures corresponding to learning with noise. 
Learning by simulated annealing is such a process where the temperature is slowly 
reduced. One unit of time corresponds to the presentation of aN patterns and 
therefore scales as N 2 . This means that learning P = aN patterns in finite time 
yields a polynomial N 2 algorithm. 

In the limit N — ■> oo a mean field approximation becomes exact. The order 
parameters of this theory are correlation functions 

Q(*i,*2) = -^E<W#i)Wi(f 2 )>, (6) 

i 

and corresponding response functions, and in addition similar functions for the 
stimuli h^t). The resulting mean field equations are coupled nonlinear integro- 
differential equations. In order to follow learning one would have to solve these 
equations with some initial condition, for instance randomly chosen weighs. 

Assuming, however, the system has reached equilibrium, the order parameter 
functions depend on t = t\ — t 2 only and the theory simplifies considerably. The 
quantity 1 — Q(t) is a measure of the portion of phase space explored in time t 
(note Q(0) = 1). 

If the system equilibrates in finite time, which is the case at high temperatures, 
Q(t) reaches some asymptotic value Q a and 1 — Q a is a measure of the size of the 
total accessible part of phase space. Asymptotically Q(t) — Q a ~ £-^ e -*/*o( T ) j s 
found. At some lower temperature T = Tf a freezing transition is possible with 
t (Tf) — > oo. For T < Tf one finds Q(t) — Q c ~ t~" with Q c > Q a . This means 
that the accessible part of phase space can no longer be explored in finite time and 
the system is no longer ergodic. 

Depending on the behavior of Q C {T) — Q a {T) for T — > TJ one distinguishes 
continuous transitions, if Q C {T) — Q a {T) — > 0, and discontinuous transitions oth- 
erwise. For a system undergoing a discontinuous transition Q(t) shows a plateau 
near Q c {Tf) already above the transition Q(t). This means that the system for 
time shorter than some t c (T) explores primarily one of the ergodic components 
which strictly forms only below the transition. The whole scenario is sketched in 
Fig.l. 
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Fig.l Order parameter Q(t) slightly above and below the freezing transition. 
Top row: Discontinuous transition. Bottom row: Continuous transition. 



The appearance of diverging time scales for T < Tf is also expected for nonequi- 
librium initial conditions appropriate for the question of learning, and one can 
therefore conclude that optimal learning is not possible if Tf > 0. As will be dis- 
cussed later this is the case for all values of a for the binary perceptron, and for 



a > dp for the committee machine. 



4 The binary perceptron 

The weights of this network are restricted to Wj = ±1 and learning according 
to Eq.(4) is no longer possible. For small networks an exact enumeration of all 
values is possible (Krauth and Opper (1989)). Extrapolating to iV — > oo the value 
a c ~ 0.833 derived within replica theory (Krauth and Mezard (1989)) is found. 
This requires, however, ~ 2^ computations. 




Fig. 2 Distribution P(h) for random Fig. 3 Rate of errors from replica theory 
weights (P ), perceptron (P per c) and (RSB), clipped perceptron (clip.perc) and 
clipped perceptron {Pun)- dynamics at Tf (dynamic). 
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Polynomial ~ N 2 learning can be done by training according to Eq. (4) assuming 
continuous weights w\ verc ^ and choosing Wi = siga(wj perc ^) . In order to estimate 
the results we can write Wi = w\ verc ^ + AWi and modify the normalization of the 
W\ verc) such that (AWi) = 0. This yields (AW?) « 0.4 and the distribution of 
the stimuli is obtained by convolution of the perceptron result, Eq.(3), with a 
gaussian of width (AW 2 ). The resulting distribution for a = 0.52 is shown in Fig. 2 
and for all values of a there is a tail extending to h < 0. This means that part of 
the patterns are not classified correctly. The resulting error rate is shown in Fig. 3. 

Dynamic mean field theory has been applied to this problem by Horner (1992, 
1993). A discontinuous ergodicity breaking transition is found for all a and the 
resulting freezing temperature Tf{a) is shown in Fig.4. Simulations with restricted 
learning time show that very little improvement of the error rate is achieved if the 
system is cooled below the freezing temperature and therefore the error rate at Tf 
is a reasonable measure of the performance of an iV 2 algorithm. As can be seen 
from Fig.3 the performance is superior to that of the clipped perceptron, but a 
finite fraction of errors remains at all values of a. 




Fig.4 Dynamic freezing temperature 
Tf(a) and transition temperature Ts=o 
derived from replica theory (Krauth and 
Mezard (1989). 



The capacity a(N, t£) for samples of various size N and total learning time tl 
has been evaluated by simulated annealing (Horner (1993)). The results shown in 
Fig. 5 indicate hat the full capacity can be reached for tl ~ only. 
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5 The committee machine (tree structure) 



The committee machine with nonoverlapping receptive fields (tree structure) has 
N = K M input nodes, K hidden units and a single output unit. Learning is done 
with the weights Wi i connecting the input nodes with the K hidden units whereas 
the weights connecting the hidden units with the output node are fixed W\ = 1 
(see Fig. 6). 




Fig.6 Committee machine 
with nonoverlapping receptive 
fields, M = 5 and K = 3. 



Presenting pattern /z, the hidden units receive a stimulus, Eq.(l) 

With = 1 the learning task is now to determine the weights such that 

£sign(/*n>0 (8) 

i 

A possible learning procedure is the following: Presenting pattern \x for exam- 
ple in a network with K = 3 one of the possibilities given below shows up during 
learning, 

sign (hf) prob. learning 

+ + + 1/8 
+ + 3/8 

+ - 3/8 1 of 2 

1/8 2 of 3 

where equal probability for each possibility is assumed for simplicity. This means 
that on average the weights of each subperceptron are updated ^aN times per 
learning cycle. Since the perceptron learning rule allows for each of the 3 subper- 
ceptrons a maximal number of M updates at maximal capacity, we arrive at the 
estimate a p (3) = | = 1.6. This is below the capacity of a simple perceptron with 
N inputs and for larger K even lower values are found, for instance ct p (5) = 1.39 
or Op(7) = 1.25. 

In general learning for the committee machine can be viewed as a two step 
process: i) Selecting the subperceptrons to be modified, ii) modifying the weights 
of the selected subperceptrons. The second step is polynomial ~ iV 2 . It might turn 
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out that the initial restricted random choice of the subperceptrons used to embed 
a given pattern is not optimal and that better distributions of the learning load 
exist. Testing this possibilities, however, is a combinatorial problem requiring of 
the order of P\ computations. 

A polynomial learning algorithm therefore has to be local in the sense that 
the decision which of the subperceptrons to select for training has to be done 
instantaneously as learning goes on. In the following we analyze a modified form 
of the least action algorithm proposed by Nilsson (1965). Among the candidates 
for learning, this means among the subperceptrons with hf < 0, we select those 
with the smallest value of \hf |. This is done by introducing a cost function E({hi}) 
which is zero if J2i sign(/i;) > and monotonously increasing otherwise. For K = 3 

E({hi}) = ~ h 9(-/ii) (l - Q{h 2 - /ii)6(/i 3 - hi)) + permutations (9) 
is appropriate. 

With this cost function it is also possible to investigate simulated annealing 
and to perform the corresponding analysis based on dynamic mean field theory, 
which has been done by Bethge (1997). Her calculation shows that a continuous 
ergodicity breaking transition exists for a > a p (K) with a p (3) ~ 1.75. The same 
value was found by Barkai et al. (1992) and by Engel et al. (1992) for the onset of 
replica symmetry breaking. This agreement is actually expected for a continuous 
transition in contrast to a discontinuous transition as pointed out in the previous 
section. Applying the arguments given in Sect. 3, we have to conclude that error 
free polynomial learning is not possible for a > a p (K). 

This appears to be in clear contradiction to the results of simulations reported 
by Barkai et al. (1992), Engel et al. (1992) and Priel et al. (1994) where for K = 3 
values of a c between 2 and 2.75 were obtained. Up to 6000 learning cycles were 
used, a value much larger than what would be expected from Eq.(5) assuming that 
the learning time is ruled by the perceptron learning part of the algorithm. The 
apparent improvement can also not be due to a better handling of the combinatorial 
part where the subperceptrons to be trained are selected. This would result in a 
strong size dependence which was not observed. In these investigations successful 
learning of all patterns with probability 1/2 was used as criterion for a c . Priel et 
al. (1994) also determined the median of the time required for perfect learning as 
function of a. They found for instance for K = 3 and a = 2 the value T me d ~ 25 
and a divergence around a c ~ 2.75. 

In order to understand this discrepancy we have performed preliminary sim- 
ulations on a committee machine with N = 150 and K = 3 using the modified 
least action algorithm described above together with the perceptron learning rule 
Eq.(4). Fig. 7 shows the probability R(a,t L ) that after time £l, this means after 
presentation of tiaN patterns, error free learning is not yet reached. For com- 
parison the learning curve for a simple perceptron is also shown. The results were 
obtained for a single set of patterns and in each of the 1000 runs used for each value 
of a the patterns were presented in a different random order. For different sets of 
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patterns similar curves are obtained but we have not tried to average over different 
sets or to investigate finite size effects. Judging from the simulations performed by 
Priel et al. (1994) iV = 150 seems to be sufficient to eliminate drastic finite size 
effects. 




Fig. 7 Fraction of incomplete 
learning as function of 

the learning time tL for a com- 
mittee machine with iV = 150, 
K = 3 and various values of a. 



There appears to be a qualitative difference between the learning curves for 
a < a p = 1.75 and a > a p , respectively. The fraction R(a,ti) decays rapidly 
to zero for a < a p whereas a much slower decay ~ t^ 1 or even slower is observed 
otherwise. For all values of a investigated, a finite median t me d of the learning time 
defined by R(a,t me d) = \ exists. The average learning time is computed from the 
probability P(a,t L ) = —dR(a,t L )/dt L ) resulting in 

(t L ) = / dtR(a,t). (10) 
Jo 

The simulations indicate that for a > a p this integral does not exist and further- 
more for any finite t L the probability for error free polynomial learning is less than 
one. 

This result has certainly to be reconfirmed by additional simulations. Never- 
theless it reveals new subtleties in the determination or even the definition of the 
storage capacity of neural networks. If for instance a certain fraction of successful 
learning is used to determine the storage capacity different values may result even 
in the limit iV — > oo depending on which fraction is required. There could also 
be another critical value a c separating a region where i?(a,t^) — > for — > oo 
from a region where this limit is finite. The results of the simulation also allow 
for speculations on the fractal structure of the accessible part of phase space being 
composed of almost disconnected subregions. 

6 Perceptron with divergent preprocessing 

The last architecture to be discussed shows a possibility to increase the storage 
capacity beyond the capacity of a simple perceptron yet retaining a polynomial 
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learning rule. The architecture of this network (coding machine) is shown in Fig. 8. 
The input layer is divided into L nonoverlapping receptive fields of size M = N/L. 
Each receptive field has its own part of size K > M of the hidden layer. The 
fixed weights connecting input and hidden layer are supposed to establish a one to 
one mapping of the input patterns £f = ±1 onto internal representations = ±1 
assuming binary inputs and threshold hidden units. For example each of the 2 M 
different inputs at any of the receptive fields can be mapped for K = 2 M onto an 
internal representation with a single active node in each part of the hidden layer, 
generating a sparse coding internal representation. This mapping could be achieved 
by unsupervised learning of the winner takes all type. For K <C K max = 2 M 
randomly chosen weights Wi i would also be possible. If the set of patterns has some 
structure reflected onto the subdivision into receptive fields, other unsupervised 
learning procedures are possible. The complete hidden layer is finally connected 
to an output unit and the corresponding weights W\ are determined by supervised 
perceptron learning. 

The whole architecture can also be viewed as a prototype for data processing 
in the brain, having in mind for instance the mapping of a comparatively small 
number of neurons in the retina onto a much larger number of neurons in the 
primary visual cortex. 




Fig.8 Coding machine with L = 4 receptive fields of size M = 2 and 
K = 2 M hidden units for each receptive field. 



Even for uncorrelated random input patterns the resulting internal represen- 
tations are correlated. Neglecting these correlations for a moment the second 
layer can be viewed as a simple perceptron of size K = LK and the capacity is 
a p = a c = 2K/M. Since learning in the second layer can be done with the per- 
ceptron algorithm it is polynomial ~ (a c M) 2 . The effect of correlations within the 
internal representations has been examined for K = 2 N by Bethge et al. (1994) 
with the result 

«c = ^(2"-l). (ID 

This shows that the correlations indeed give rise to negligible corrections and the 
coding machine is a useful and fast learning architecture if capacities beyond the 
limits of the perceptron are required. 
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7 Outlook 



The performance of different feed forward neural network architectures under the 
constraint of learning in polynomial time has been subject of this study. The 
main results are the following: For the simple perceptron and the coding machine 
learning uncorrelated random patterns the full capacity a c can be exhausted by 
a polynomial learning rule. For a perceptron with binary weights error free poly- 
nomial learning is not possible. The finite size scaling inferred from simulations 
indicates that the full capacity a c ps 0.833 obtained from replica theory can only be 
reached allowing for learning times ~ e^. A mean field analysis of this architecture 
yields a discontinuous ergodicity breaking transition in a region where the replica 
symmetric solution is stable. 

For the committee machine with nonoverlapping receptive fields a continuous 
transition shows up for a > a p , and because of the diverging time scales at the tran- 
sition a p is a bound of the maximal capacity which can be reached by polynomial 
learning. This bound is well below the corresponding value of a simple perceptron 
of same size and the value a c obtained from replica theory with one step replica 
symmetry breaking is even higher. The dynamic transition coincides, however, 
with the onset of replica symmetry breaking. From simulations, storage capacities 
well above a p were deduced. Preliminary simulations evaluating the probability of 
perfect learning as function of learning time indicate that this discrepancy is due 
to subtleties in the evaluation of the numerical data. 

We have not addressed the most interesting question of generalization ability 
for patterns created by some rule, for instance a teacher neural network. If teacher 
and student have the same architecture the situation is quite different, but for 
different architectures or for a restricted training set the present discussion might 
be of relevance. 
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